Recognition of table of contents for electronic library consulting
Identifieur interne : 000249 ( France/Analysis ); précédent : 000248; suivant : 000250Recognition of table of contents for electronic library consulting
Auteurs : Abdel Belaïd [France]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2001.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Dictionnaire.
- mix :
English descriptors
- KwdEn :
Abstract
A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.
Url:
Affiliations:
- France
- Alsace-Champagne-Ardenne-Lorraine, Lorraine, Région Lorraine
- Nancy, Vandoeuvre-Lœs-Nancy
- Centre national de la recherche scientifique, Institut national de recherche en informatique et en automatique, Laboratoire lorrain de recherche en informatique et ses applications, Université de Lorraine
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000696
- to stream PascalFrancis, to step Curation: 000096
- to stream PascalFrancis, to step Checkpoint: 000659
- to stream Main, to step Merge: 001C86
- to stream Hal, to step Corpus: 000101
- to stream Hal, to step Curation: 000101
- to stream Hal, to step Checkpoint: 000148
- to stream Main, to step Merge: 001D23
- to stream Main, to step Curation: 001B91
- to stream Main, to step Exploration: 001B91
- to stream France, to step Extraction: 000249
Links to Exploration step
Pascal:02-0010779Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Recognition of table of contents for electronic library consulting</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>LORIA-CNRS Campus Scientifique, B.P. 239</s1>
<s2>54506 Vandoeuvre-Lœs-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="old region" nuts="2">Lorraine</region>
<settlement type="city">Vandoeuvre-Lœs-Nancy</settlement>
</placeName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="region" nuts="2">Région Lorraine</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">02-0010779</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0010779 INIST</idno>
<idno type="RBID">Pascal:02-0010779</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000696</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000096</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000659</idno>
<idno type="wicri:doubleKey">1433-2833:2001:Belaid A:recognition:of:table</idno>
<idno type="wicri:Area/Main/Merge">001C86</idno>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:inria-00100452</idno>
<idno type="url">https://hal.inria.fr/inria-00100452</idno>
<idno type="wicri:Area/Hal/Corpus">000101</idno>
<idno type="wicri:Area/Hal/Curation">000101</idno>
<idno type="wicri:Area/Hal/Checkpoint">000148</idno>
<idno type="wicri:doubleKey">1433-2833:2001:Belaid A:recognition:of:table</idno>
<idno type="wicri:Area/Main/Merge">001D23</idno>
<idno type="wicri:Area/Main/Curation">001B91</idno>
<idno type="wicri:Area/Main/Exploration">001B91</idno>
<idno type="wicri:Area/France/Extraction">000249</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Recognition of table of contents for electronic library consulting</title>
<author><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>LORIA-CNRS Campus Scientifique, B.P. 239</s1>
<s2>54506 Vandoeuvre-Lœs-Nancy</s2>
<s3>FRA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>France</country>
<placeName><region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="old region" nuts="2">Lorraine</region>
<settlement type="city">Vandoeuvre-Lœs-Nancy</settlement>
</placeName>
<placeName><settlement type="city">Nancy</settlement>
<region type="region" nuts="2">Alsace-Champagne-Ardenne-Lorraine</region>
<region type="region" nuts="2">Région Lorraine</region>
</placeName>
<orgName type="laboratoire" n="5">Laboratoire lorrain de recherche en informatique et ses applications</orgName>
<orgName type="university">Université de Lorraine</orgName>
<orgName type="institution">Centre national de la recherche scientifique</orgName>
<orgName type="institution">Institut national de recherche en informatique et en automatique</orgName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Automatic recognition</term>
<term>Canonical form</term>
<term>Character recognition</term>
<term>Content analysis</term>
<term>Dictionaries</term>
<term>Electronic library</term>
<term>Image segmentation</term>
<term>Layout problem</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Bibliothèque électronique</term>
<term>Segmentation image</term>
<term>Problème agencement</term>
<term>Forme canonique</term>
<term>Dictionnaire</term>
<term>Texte</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance automatique</term>
<term>Analyse contenu</term>
<term>Reconnaissance optique caractère</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Dictionnaire</term>
</keywords>
<keywords scheme="mix" xml:lang="fr"><term>calliope project</term>
<term>digital library</term>
<term>projet calliope</term>
<term>reconnaissance de documents</term>
<term>table de matières</term>
<term>table of contents recognition -- part-of-speech tagging -- ocr combination</term>
<term>étiquetage par partie de discours</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">A labelling approach for the automatic recognition of tables of contents (ToC) is described in this paper. A prototype is used for the electronic consulting of scientific papers in a digital library system named Calliope. This method operates on a roughly structured ASCII file, produced by OCR. The recognition approach operates by text labelling without using any a priori model. Labelling is based on part-of-speech tagging (PoS) which is initiated by a primary labelling of text components using some specific dictionaries. Significant tags are first grouped into homogeneous classes according to their grammar categories and then reduced in canonical forms corresponding to article fields: "title" and "authors". Non-labelled tokens are integrated in one or another field by either applying PoS correction rules or using a structure model generated from well-detected articles. The designed prototype operates very well on different ToC layouts and character recognition qualities. Without manual intervention, a 96.3% rate of correct segmentation was obtained on 38 journals, including 2,020 articles, accompanied by a 93.0% rate of correct field extraction.</div>
</front>
</TEI>
<affiliations><list><country><li>France</li>
</country>
<region><li>Alsace-Champagne-Ardenne-Lorraine</li>
<li>Lorraine</li>
<li>Région Lorraine</li>
</region>
<settlement><li>Nancy</li>
<li>Vandoeuvre-Lœs-Nancy</li>
</settlement>
<orgName><li>Centre national de la recherche scientifique</li>
<li>Institut national de recherche en informatique et en automatique</li>
<li>Laboratoire lorrain de recherche en informatique et ses applications</li>
<li>Université de Lorraine</li>
</orgName>
</list>
<tree><country name="France"><region name="Alsace-Champagne-Ardenne-Lorraine"><name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">Abdel Belaïd</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000249 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000249 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= France |étape= Analysis |type= RBID |clé= Pascal:02-0010779 |texte= Recognition of table of contents for electronic library consulting }}
This area was generated with Dilib version V0.6.32. |